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Abstract 

An artificial neural network is presented based on the idea of connections 
between units that are only active for a specihc range of input values and 
zero outside that range (and so are not evaluated outside the active range). 
The connection function is represented by a polynomial with compact sup¬ 
port. The hnite range of activation allows for great activation sparsity in 
the network and means that theoretically you are able to add computational 
power to the network without increasing the computational time required to 
evaluate the network for a given input. The polynomial order ranges from 
hrst to hfth order. Unit dropout is used for regularization and a parameter 
free weight update is used. Better performance is obtained by moving from 
piecewise linear connections to piecewise quadratic, even better performance 
can be obtained by moving to higher order polynomials. The algorithm is 
tested on the MAGIC Gamma ray data set as well as the MNIST data set. 

Keywords: artihcial neural network, piecewise polynomial, discontinuous, 
high order, autoencoder, FLANN 


1. Introduction 

Neural networks and their application to deep learning have become a 
focus of research due to the success of the algorithms on several types of 
problems The goal of this work is to create a multi-layer artihcial 

neural network where connections between units are active only in a hnite 
range. This naturally leads to grouping connections together into functional 
links that span the full range of input values. Each connection then is called 
a sub link of the complete link. Each sub link is represented using piece- 
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wise polynomial with compact support. The resulting link looks like a one 
dimensional grid of discontinuous piecewise polynomials. 

The motivation for this work comes from the (1) the knowledge that 
activation of functional elements in biological neural networks is extremely 
sparse; (2) the idea that fewer layers in deep neural networks are required 
if one increases the computational capability of each layer; (3) the fact that 
the use of piecewise quadratic or higher order polynomials in an ANN re¬ 
sults in increased polynomial output whereas the output of a piecewise linear 
network remains piecewise linear regardless of the number of layers; (4) in¬ 
creasing the sparsity (adding additional sub links) can theoretically increases 
computational power without increasing computational time; (5) higher or¬ 
der approximations are known to reduce the problem of adversarial examples 
[5] and (6) it has been shown that dendrites (dendritic spines) are not just 
passive elements, but perform computations themselves [33] . 

The algorithm starts with a simple technique for approximating func¬ 
tions, using piecewise discontinuous polynomials. Typically, discontinuous 
functions are avoided in artihcial neural networks because gradient descent 
derivatives are not dehned at the discontinuities. In this paper, gradient 
descent is used as we wait for a alternate algorithms to be developed, one 
possible approach is recursive decomposition as described by Friesen in [8]. 
The discontinuities act to break up the network into a series of sub networks. 
Gradient descent is applied to each of these sub networks separately and 
good results are achieved. The fact that this works is not entirely surprising 
given that the dropout technique naEg has the same effect on a network 
during training and is equivalent to training on a discontinuous network. 
To see this, evaluate a network with half the hidden neurons dropped out 
and record the result. Then select another set of random neurons to drop 
out and again evaluate the network with the same input. The output from 
the two cases will be different and so the network is discontinuous during 
training as a result of dropout. Note that at least one discontinuity per link 
should be used to remain a universal approximator as described by [ 20 ] • The 
discontinuous network tends to produce over htted results in many prob¬ 
lems. Over-htting is resolved by using the dropout regularization technique 
described by [I21I32]. In addition, a parameter free weight update method 
as described by [30| is used to reduce the parameter search space. Lagrange 
polynomials are used to describe the piecewise polynomial functions and, as 
such, the weights of the polynomial are the actual value of the polynomial 
at specihc locations. 
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In this paper, the link is the important compntational element, bnt this 
approach can just as well be applied to the unit. Piecewise polynomial ap¬ 
proaches have been investigated in the CMAC architecture |T6] and more 
recently the rectihed linear unit | 23 ], has become a popular activation func¬ 
tion which is piecewise linear . High order neural network using higher order 
weighting terms are described by several authors including HDl ED ESI El i 
and functional link artihcial neural networks (FLANN) by [2511261128] . which 
is the approach used in this paper. Discontinuous neural networks have 
been discussed in many articles, especially with respect to recurrent neu¬ 
ral network including [DEDi focused on convergence state estimation and 
stability and by [3S] where a unique recurrent high order algorithm is de¬ 
rived for hnancial modeling, but where additional free parameters (weights) 
are added to the unit and a few simulations are performed with piecewise 
high order elements. There is very little work on multi-layer high order 
discontinuous polynomial networks. In this paper the algorithm is tested by 
performing simple curve htting through the sine wave using a single link, and 
then classihcation with the MAGIC gamma ray detection data set and the 
MNIST data set. The MNIST test is performed using multiple autoencoders. 
The unique contribution of this paper is the application and development of 
novel algorithm, using discontinuous piecewise polynomial approximation, in 
a multi-layer neural network as well as the use of the parameter free weight 
update described by [30] • The algorithm opens the possibility of using a 
variety of complicated, discontinuous elements in an artihcial neural network 
using back propagation. 

Section describes the algorithm used in this paper including backprop- 
agation with discontinuities, weight initialization and link input range selec¬ 
tion. In Section]^ we demonstrate the algorithm on 3 problems: a simple ID 
function approximation; the Magic gamma ray data set; the MNIST data 
set. In Section]^ we conclude the paper. 

2. Algorithm 

Dehnitions used in this paper are dehned in table There are two natu¬ 
ral ways to apply higher order approximations in artihcial neural networks. 
The hrst is to replace the single weight at the link with multiple adjustable 
weights describing a more complicated link function - this is a functional link 
neural network (FLANN) as described by [25] and demonstrated by [27]. 
The alternative is to add adjustable parameters to the unit that describe a 
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Is the connection between two units. 

A link is split into sub links. 

The weight of a sub link, 

the weight of the link. 

The basis function of a sub link. 

The number of active links into a unit. 

The number of active links out of a unit. 

The number of Chebyshev-Lobatto nodes. 

The input to the link. 

The Chebyshev-Lobatto node. 

The measured output value for output link i 
The desired output value for output link i 
The output function for link i. 

The output for link i. 

The network error for a single input. 

Time to completion using sub links with Np nodes 


Table 1: Variable definitions used in this paper. 


changing activation function, in this case adjustable parameters exist in both 
the unit and the link. In this paper we chose the FLANN approach as it can 
be written slightly more compactly while maintaining weights dehned at the 
link. 

The error correction algorithm used is backpropagation as described by 
ra applied to a FLANN with a minor modification described in Section 
2.1 The weight update rule is defined by using the parameter free weight 
update rule described by [30]. A slight modification to this rule is that the 


maximum learning rate is set to 0.9. The network description is that of a 
standard feed forward network, see Figure with input links and output 
links added. Labels for a network element are shown in figure The main 
differences of the algorithm compared to a standard network [29] are: (1) 
there are multiple weights per link; (2) No bias units are used; (3) the unit 
averages the input signal to produce an output instead of applying a more 
complex activation function; (4) input/output links are added which can be 
used to normalize and shift the data to the desired input/output range. 

The weight function of a sub link is described by the following equation. 


/ (^) = 


( 1 ) 
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with the basis functions given by the Lagrange polynomials 
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The Lagrange polynomial Bj is useful because it has the value 1 at a: = Xj 
and has the value 0 at all other Xi. The interpolation nodes are given by the 
Chebyshev-Lobatto nodes, these differ from pure Chebyshev nodes in that 
the end points of the domain are included. The Chebyshev-Lobatto nodes 
are given by, 



where k ranges from 0 to iVp — 1 and Np is the total number of Chebyshev- 
Lobatto points. This means that in Equation (|^ the value of the function at 
Xj is Wj. Using this approach we can easily limit the range of a polynomial 
interpolation by limiting the range of the weights Wi. For clarity we provide 
the polynomials for both a linear and quadratic interpolation. In the linear 
case (assuming xq = —1 and xi = 1) 

Bo = -^(x- 1) 

51 = i (x + 1) 

In the quadratic case we assume Xq = — 1, Xi = 0, X 2 = 1 the basis functions 
are 

Bo = ^x{x- 1 ) 

Si = - (x -1-1) (x - 1) 

5 2 = ^ (x -Fl) X 


In addition to the high order weighting, discontinuities are used along 
with the piecewise polynomial approximation. This means that the function 
is different depending on the range of the input variable x. In particular we 
have the definition 

/ if <X<ao] fio (Xi) \ 

p, ^ if [ao < X < ai] fi 1 (x*) 

y if [Ori ^ X ^ ^max] fin i.Xi) j 
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At the unit, the average of the incoming signals is computed. The unit 
could be a sigmoid, but it is not needed since the non-linearity is provided 
by the presence of at least one discontinuity, so only a simple average is used 
as the activation function of the unit. 



The averaging is important because the function Fi only has a valid solution 
in a hnite range so it is important to guarantee that any signal passed to Fi 
is within that range. This can easily be accomplished by choosing initial 
and r^ax correctly. iVj„ is the number of input signals (not the total number 
of input sub links). 

The idea to use a discontinuous piecewise polynomial comes from a pair 
of units with multiple links between the units, see Fig. Only a single 
link is active depending on the output value of the unit. Equation (|^ can 
be described as a set of links between two units where only one link passes 
an output signal for a given input signal. As a result, the links are grouped 
together in a bundle (Equation and shown in hgure The standard link 
of a neural network described in this paper then consists of one or more sub 
links. The link function can now be thought of as a one dimensional grid 
with piecewise polynomial elements within each grid cell as shown in Figure 


E 
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Figure 1: A standard feed forward network. Blue circles represent summing unit and 
green lines represent links where the signal passes from one neuron to the next along the 
direction of the arrow. 
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Figure 2: Zoom in of connected units. 
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Figure 3: Two units are connected by several links. Each link is only active for a range of 
possible inputs. This links can be combined into a single link with multiple weights and 
discontinuities between sub links. 



Figure 4: A pair of units connected by a single link with 3 sub links where only one sub 
link is active for a given input signal. This grouping of links is called a “bundle”. This is 
the simplified approach taken in this paper. 


9 









Figure 5: A link can be represented as a grid with a piecewise function valid on each of 
the ranges of the sub links. The link function is discontinuous at the ends of each range. 
Notice that in this link the sub link ranges are not equally spaced - and they don’t need 
to be. For the purposes of this paper, all ranges are equally spaced. 
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2.1. Backpropagation with Discontinuities 

A network using a link with at least two sub links has the following 
property: for a given input signal only a subset of the network is active. 

When training is performed, only the weights of the activated sub links are 
updated. Figure shows a 3-layer network consisting of two links, each link 
with two sub links. Figure shows this same network with all the possible 
paths that a signal can take from input to output. It is evident from Figure 
[^that there are 4 networks where each network shares some weights with the 
other networks. It’s easy to over train this type of network since each signal 
only effects a subset of the weights. To resolve over training the “dropout” 
regularization technique is used as described in [12]. 

The key to getting an algorithm working with functions that are discontin¬ 
uous is a backpropagation algorithm that works for discontinuous functions. 

The implementation is incredibly simple and the only requirement is to re¬ 
move idle sub links from the back propagation step. After a signal has been 
propagated through the network, back propagation is performed, but only 
on the subset of the network active for the input signal. This means that for 
each link, only one sub link is active, and so back propagation is performed 
only on that sub link, all other sub links in the link are ignored. When 
weights are updated, they are only updated in the links that hred during the 
forward step. 

For clarity we include a description of the back propagation applied to 
this network, recall that backpropagation is only applied to the active sub 
network. 

• Forward propagate input signal 

• Record each sub link that is activated for the given input signal 

• Back propagate error signal through active sub links 

• Update weights of the active sub link 

• Ensure that weights are within desired range, Wi = min {wmax, max {wmin, Wi)) 

• Repeat with a new input signal 

The error at the output of the network is measured as 

E = ^ ^ ~ yd,i) 
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The derivative of a link output with respect to a link input is given by 

dyi _ dFj 
dxi dxi 

Although, Fi is discontinuous at points, it is entirely continuous within the 
range of the derivative. If the derivative happens to be required exactly at 
the discontinuity, the derivative is taken only in the activated sub link (either 
left or right of the discontinuity). The derivative across a unit to one of the 
unit input links is 

dyj ^ 1 

dXi Niyi 

The derivative of the weight with respect to the link output is 

dVj ^ 9F, 

dWj^a dWj^a 


Error at the top of the link in the output link 

dE dyi dE dyi 

dEi = — = = ^{yi- ydi) 

dXi dXi dyi dXi 

Error at the top of the link one layer in 

dE dyj dE dyj ^ 

dxj dyj dxj ^ dyj dxi dxj ^ dy 

In the specific case of an averaging unit, the error is 


= a.. 


dxi dE _ d^ 


_ % ^ 

^ dXj Nin 


6Ei = 




The rule one layer in can be applied to all further layers. The error in the 
weight is then 

dE dyj dE dyj 1 


dw, 




dwa,j dyj dwa,j Ni, 




At this point, all weight update rules that work for standard neural networks 
will work for this algorithm as well. A simple example is the momentum 
update 

dE 

, . = i/i"- . — II- 

a,] 


^-+1 =«;!!.- + 7 {w^j - wl;j^) 




dw, 


a,] 


Although the simple update works for this problem, we instead use Schaul’s 
parameter free update PI. 
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Figure 6: Two stacked bundles and their associated units. In this case each bundle has 
two sub links and therefore a signal can take one of two paths for each bundle. 


2.2. Accelerated Backpropagation 

The backpropagation algorithm described is the standard algorithm ap¬ 
plied to this model. It was pointed ont by [T] that a simple modihcation 
to the back propagation algorithm distribntes the error more evenly among 
nnits. In particnlar, the algorithm shown snggests that an error at the nnit 
is 1/Nin times the error at the top of the ontpnt link (the links leaving the 
nnit). The error at the ontpnt of the nnit is distribnted evenly over the inpnt 
links, and therefore the more inpnt links there are, the smaller the error that 
each inpnt link receives. A simple way to remedy this problem is to replace 
the back propagation 1/Nin (one over the nnmber of active links entering a 
unit) with 1/Nout (one over the number of links leaving a unit) which will 
result in a more even distribution of weight change throughout the network 
and faster convergence. This technique is used in this paper. 
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Path 1 Path 2 Path 3 Path 4 
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c? p 43 

• • • • 


Figure 7: The total number of paths for two bundles stacked on top of one another, with 
2 sub links each, is 4. The 4 paths are drawn out in this diagram. Only the active path 
is used in the back propagation algorithm for each signal. This means that the network 
is actually 4 separate coupled networks. The coupling occurs because some of the weights 
are shared between networks (whenever part of a path is shared, the weights are shared). 
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2.3. Weight Initialization 

Weight initialization is a huge concern for neural networks (|3l])- In this 
paper, the weights within a link are initialized such that F{x) (Equation]^ 
is a line across the link, the equation is given as 


Ui = 


Xi 


max ' mm 


OJa. •) 


(3) 


where Wa is chosen with a uniform random distribution in the range [—1,1]. 
2 . 4 . Choosing Ranges Tmin ond Tmax 

Choosing proper ranges for the links is key to getting good solutions. The 
weights used in the Lagrange polynomial interpolation do not mark the limits 
of the polynomial value except in the linear case. Figure]^ shows 5 Lagrange 
polynomials with maximum overshoot for weights within the desired range 
- note that only in the linear case is the range limited by the values of the 
weights. Instead, if the weights are in the range [—Wmax, i^max] then a given 


choice of weights will produce a maximum overshoot Pn,maxi^max where values 
of Pn, max dxe giveu in Table For a given input, the maximum possible 
output would be Pn,max^ma^ which could then be the input to the next link. 
The input range for each link should be set to [—Pn,maxi^max,Pn,maxi^max] to 
account for these overshoots. One potential consequence of this choice of 
range is input value decay. Consider a deep network with only one unit per 
layer and one link between units. Each layer is initialized using Equation 
(2.3) with Wa = I and input range [—Pn,max 5 Pn,max] Suppose the input value 
is Xin then as the input value passes through each layer it will be compressed 
if rmax = Pn,ma^ > (Xmax by the ratio Umax/rmax- After passing through n 
layers, the output signal will be {pJmaxI'f'maxY Xin. As a consequence, the 
output from each layer is pushed towards the value 0 which means fewer and 
fewer of the network sub links are used. Fortunately, two things can occur to 
help the situation: (1) if a discontinuity is at the origin rapid learning can still 
occur since if a signal is either side of 0 it will adjust disconnected weights; (2) 
as the network is trained, more and more of the sub links are used. It would 
seem randomly initializing the weights could alleviate this problem, however. 


we’ve found that symmetric initialization as in Equation (2.3) works better. 
In this paper we use r^ax = -Train = Pn,max where Pn,max is provided in Table 
1^ which gives the maximum overshoot for the polynomials when the weights 
are constrained to be between 1 and -1. 
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number of points 2 

3 

4 

5 

6 

7 

8 

9 

10 

Pn,max 1 

1.25 

1.667 

1.799 

1.989 

2.083 

2.203 

2.275 

2.362 


Table 2: Maximum overshoot fraction, Pn,ni 3 .x, given the number of points in the polyno¬ 
mial interpolation (to 4 digits rounded up). 



Figure 8: Lagrange polynomials with weights in the range [—1,1] chosen for maximum 
overshoot. Even though the weights are limited to values in the range [—1,1] the function 
can produce values outside this range. The next link then should have an input range that 
accounts for this overshoot. 
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2.5. Automatic Normalization 

Since input links and output links are used in this network and the range 
and weight is arbitrary, data does not need to be normalized before passing 
to the network. Normalization automatically occurs for input links by the 
user’s choice of and r^ax for the input links, and by the users choice 
of Wmin and Wmax for the output links (see Figure for dehnitions of input 
and output links). For example, if the MNIST data is being used and image 
values range from 0 to 255, simply set = 0 and r^ax = 255 in the input 
link. Similarly, if the output values should be between 0 and 100 then set 
the weight limits in the output link to Wmm = 0 and Wmax = 100. Although 
not particularly important, this is a convenient way to deal with input and 
output without a separate normalization step. 

3. Results 

In this section we use the network on 3 problems. The hrst is the simple 
sine wave to illustrate how the functional link approximates this function. 
This provides clarity for how the link in this network differs from the single 
weight used in the standard approach. The second problem uses the MAGIC 
gamma ray detection data set to predict if a gamma ray has been detected or 
not. Finally, results for the MNIST problem are computed with 10 autoen¬ 
coder classihers (one for each digit) in a manner similar to that described in 
|15j . though using the reconstruction error to determine the digit. All results 
are computed using online backpropagation. In this problem, all input and 
output links have hxed weights and the link function is linear with (Uq = — 1 
and (Un, = 1. Input links have a range [rmin,rmax] dependent on the range of 
the input parameters. 50% dropout is used in all cases except the sin wave 
which only uses one link. 

3.1. Sine Wave 

A sine wave is approximated using a single link with several sub links 
showing the function approximation capability of a single link. This problem 
is illustrative of how the functions in each sub link are combined to produce 
the full function. Figure shows a sine wave approximated using a bun¬ 
dle with 3 and 5 linear sub links where discontinuity is allowed between the 
sub links. Figure shows the same sine wave approximated using a link 
with 2 and 3 quadratic sub links with discontinuity between them. Note 
that the linear approximation has 6, and 10 degrees of freedom (for 3 and 5 


17 


ranges) and the quadratic approximation has 6 and 9 degrees of freedom (for 
2 and 3 ranges). Despite having fewer degrees of freedom the quadratic ap¬ 
proximation is substantially better than the linear approximation. This just 
illustrates the fact that higher order polynomial approximations require fewer 
degrees for freedom (for the same accuracy) than lower order approximations 
during function approximation. 
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Figure 9: A single link used to approximate a sine wave with piecewise linear sub links 
(Np = 2). As the number of elements increases the fit improves. Discontinuities are visible 
at sub link boundaries, the level of discontinuity decreases as the number of sub links is 
increased. 
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Figure 10: A single link used to approximate a sine wave with piecewise quadratic elements 
{Np = 3). In this case we get a much better match than the case with piecewise linear 
elements even with fewer total weights. This is a well-known feature of higher order 
approximations. 
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3.2. Gamma Ray Detection 

The gamma ray detection data is data created to simulate the detection of 
gamma rays in a ground based atmospheric Cerenkov gamma ray telescope. 
There are 19,020 examples, where 2/3 of the data is used for training and 
validation and the remaining l/3rd is used for testing. The data is shuffled 
10 times using a different set for training and test. In the validation runs the 
10% of the training data is used as the validation set. 

The data can be obtained from the University of California Irvine repos- 
itory [2T] . In the results that follow a parameters scan for network geometry 
[3, 5, 10, 20, 80, 160] neurons in the layer and with [1, 2, 3, 4, 5, 6, 7, 8] 
layers, number of sub links was varied from 1 to 6 over one epoch and with 
iVp = 2 to 6. The solution that produced the best validation accuracy was the 
case with 80 units in the hidden layers, and 2 sub links. Using this network 
geometry we then ran iVp = 2 to 6 for 50 epochs, using 10 separate test and 
training sets. The results are shown in Table Table shows previously 
published results using a decision tree with softening splits [3] for comparison. 
Table gives the confidence scores as a percent chance that the means of the 
Np in the columns and the means of the Np in the rows actually match for 
the results of table [H 

Table shows that piecewise linear Np = 2 does not perform nearly as 
well as the higher Np solutions. The performance on this problem reaches a 
minimum around Np = 5. In particular, the performance improvement from 
iVp = 2 to iVp = 3 is about 25%. A key difference between the piecewise 
linear case Np = 2 and the higher order interpolations is that even after N 
layers are added to the network, the output of a piecewise linear network, is 
still piecewise linear, where as if an nth order polynomial is used (for n > 1) 
as the link function, the output polynomial function is of order n k where k 
is the number of layers of links. 
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TVp 

test error 

a 

2 

0.2065 

0.0225 

3 

0.1474 

0.0076 

4 

0.1368 

0.0053 

5 

0.1311 

0.0039 

6 

0.1318 

0.0041 


Table 3: Results on MAGIC data set based on the average of 10 splits of the original data 
set. Each case was run for 50 epochs on a fully connected network with 4 hidden layers 
of 50 neurons each, and 2 sub links per link, a is the standard deviation of the averaged 
results. 


iVp 

2 

3 

4 

5 

3 

l.Oe-3 




4 

3.1e-4 

3.8e-l 



5 

2.9e-4 

6.2e-3 

2.4 


6 

3.2e-4 

l.Oe-2 

4.9 

100 


Table 4: T-test table showing the % risk that the means given in table are actually 
equal. Note that all scores are less than 5% except the case comparing 6 and 5. What 
this table shows is that higher order accuracy produces significantly better results in this 
problem until iVp = 6 where the difference is insignificant. 


test error cr 
0.1376 0.00087 


Table 5: Results copied from [3] which used a decision tree with softening splits. In [3] 
the original data set was split randomly into 7 different test cases. Here those results are 
averaged and the standard deviation computed. Result indicate that results using a the 
discontinuous polynomial neural network are competitive for iVp > 3 when compared with 
Table 1 u is the standard deviation of the results. 
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3.3. Handwritten Digit Recognition 


The MNIST is a standard benchmark of nenral network codes on optical 
character recognition [19]. The MNIST data set consists of a 60,000 image 
training set of 10 digits written from NIST employees and high schoolers. In 
addition there is a 10,000 image test set that is nsed to test the generalization 
capability of the network after the network is trained on the training set. 
Dnring parameter scans the test set is randomly split into test and validation 
gronps with 50000 in the test gronps and 10000 examples in the validation 
gronps. The images consist of 28X28 pixel 1 byte/pixel gray scale images. As 
snch, there are 728 inpnt links (one for each inpnt pixel) and 728 ontpnt links. 
3 hidden layer are nsed in the resnlt that are presented with a width of 7 as 
dehned in 11 The digit recognition problem is solved nsing an antoencoder 


for each digit. There are 10 antoencoders trained with each of the 10 digits. 
The Antoencoder 0 is only trained on the digit 0 examples, antoencoder 1 is 
only trained on the digit 1 examples etc... At the end, the test set examples 
are rnn throngh each antoencoder, the predicted digit is determined by the 
antoencoder that prodnces the smallest error for the given inpnt. The error 
used in this paper is reconstruction error, improvements in this error measure 
for this approach to classihcation are explored in |15j . 

Parameter scans were performed by varying the number of sub links from 
1 to 14. The width was varied from 2 to 7 and Np was varied from 2 to 
6. The best conhguration validation results are presented in Table with 
results based on the results of 10 different validation and test sets based on 
shuffling of the 60,000 example complete test set. 


• Np - the number of Chebyshev-Lobatto points in the interpolation. 

• layers - the number of neurons in each hidden layer. 

• sub links - the number of sub links in each link. 


width - the width of the stencil as dehned in Figure 11 


Table shows MNIST results for Np = 2 to Q varying the number of 
neighbors used in the multi-layer antoencoder. At a width of 7, a signal from 
the upper left corner of the input image is able to interact with the signal 
from the lower right. T-test were performed to measure the signihcance of 
the difference in the measured means, see Table In this particular problem 
it was shown that all cases with Np > 2 were much better than the case with 


23 





Figure 11: Nearest neighbor connectivity used on the MNIST problem. In this case the 
unit in the lower layer is connected to the N nearest neighbors in the previous layer. In 
the above example there are 9 nearest neighbors and the width of the stencil is defined 
to be 1. A stencil with width=2 would have 25 nearest neighbors. 
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Np 

sub links 

validation error 

a 

2 

12 

0.07124 

0.00276 

3 

10 

0.05063 

0.00293 

4 

10 

0.04804 

0.00216 

5 

6 

0.04762 

0.00271 

6 

6 

0.04879 

0.00295 


Table 6: Autoencoder MNIST validation results running for 1 epoch based on the best 
network parameters for the given Np. Note that 1 epoch is equivalent to presenting 5000 
examples for each of the autoencoders. The 3 point interpolation performs significantly 
better than the 2 point interpolation, and the performance increases to Np = 5. The best 
average performance is achieved for Np = 5. 


Np 

2 

3 

4 

5 

3 

1.5e-9 




4 

2.3e-ll 

6.1 



5 

8.6e-ll 

4.7 

100 


6 

4.0e-10 

29 

83 

60 


Table 7: Percent risk that the means are actually equal in Tabl^for each Np which is a 
measure of the significance of the difference of the means. This table shows that Np = 6 
is not significantly different than Afp=3,4,5 using 5% risk. Np=5 however is significantly 
better than Np=2,3 assuming a risk of 5% 
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Np sub links 

training error 

test error 

5 6 

0.007183 

0.0145 


Table 8: Autoencoder MNIST test results running for 50 epochs based on the best network 
parameters in Table In this case the full training set was used for training. 


Np = 2. In addition Np = 4 and Np = 5 was significantly better than Np = 3 
assuming a risk of 7%. However, the difference in iVp = 4 and Np = 5 is 
statistically insignihcant. Also, note that Ap = 6 is not signihcantly better 
than Np = 3. 

Recall, that for larger Np not only does the order of accuracy increase, but 
also the amount of compression (see Table |^. This means that although the 
order of accuracy is increasing the network inputs may be being compressed 
to a smaller region of the polynomial, and therefore reducing the effectiveness 
of increasing the order of accuracy. In Table the number of sub links is 
always an even number, this is because an odd number of sub links typically 
gives a worse solution than an even number. It’s thought that there are 2 
causes for this. (1) In deep networks the initial weight values are random 
values about 0 and the output of each link is averaged at the unit which tends 
to focus that output around the value 0. As the number of layers is increased 
the output of each unit approaches 0 more closely. If an odd number of sub 
links are used, the link function is smooth at 0. If an even number of sub 
links are used then there is a discontinuity at 0. The discontinuity near 
the origin means that the solution can rapidly jump between different sub 
networks and prevents the much slower convergence that is observed with an 
odd number of sub links. As such, it’s important that a discontinuity should 
occur at the origin. 

Table m shows the best case from the validation tests in Table [6] run out to 
50 epochs. Here we achieve a test error rate of 1.45%, which can be compared 
with [15] and na which show results of 1.27% and 1.1% respectively. Despite 
the performance discrepancy on this problem we do believe that there are 
a huge number of options for improving performance. This simplest thing 
to do would be to add in the non-linearity at the neuron. In this case, to 
keep the polynomial nature of the network, we suggest using the rectihed 
linear unit. In addition, as the network approaches convergence the network 
for a given input settles on a single sub network. In this case, that network 
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can be described by a smooth polynomial. This suggest that a non-convex 
optimization technique should be used (even for the case of 0 discontinuities) 
and one might be able to take advantage of techniques specihcally designed 
for high order polynomials such as im or |T8] . Techniques for improving the 
performance are left as future work. 
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Np 

sub links 

^TVp (seconds) 

tNp/tN2 

2 

2 

50 

1 

3 

2 

61 

1.2 

4 

2 

77 

1.5 

5 

2 

90 

1.8 

6 

2 

108 

2.2 


Table 9: Timing results for equal number of links for 1000 inputs. 


3.4- Timing Results 

A critical question is how much computational time is added for Np > 2. 
Both the derivatives and function evaluations become more complex as Np 
increases. Furthermore, it might seem that adding additional sub links should 
increase the computational time. Theoretically, adding new sub links does 
not increase the time for a single iteration. With evenly spaced sub links the 
active sub link is determined in 0(1) time, so it’s independent of the number 
of sub links. However, adding new sub links does change the memory usage 
and structure and can increase inefficiencies that way. 

In Tables and IT below we run the MNIST problem as above with 5 
layers, and width 6 with the other parameters specihed in the table. The 
results were computed by running the test case 10 times and averaging the 
main loop time. Adding additional sub links is not cost free, Np = 2 with 
6 sub links is 12% slower than Np = 2 with 2 sub links, but with 3 times 
the degrees of freedom - this is despite the fact the only one sub link is ever 
active. In addition increasing the number of sub links with A), = 2 to 12 
and the time jumps to 111 seconds (122% slower than with 2 sub links) for 
1000 iterations. Despite this fact, the user still gets greater computational 
power without signihcantly increasing computational time. Table shows 
that Np = Q is only 2.2 times slower than Np = 2 despite having 3 times the 
degrees of freedom. Similarly, in Table IT with the same number of degrees 
of freedom W = 6 is only 1.83 times slower on average. 







Np 

sub links 

tNp (seconds) 

tNp/tN2 

2 

6 

59 

1.0 

3 

4 

65 

1.1 

4 

3 

90 

1.5 

6 

2 

108 

1.8 


Table 10: Timing results for equal number of degrees of freedom for 1000 inputs. 
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4. Conclusion 

A novel approach to artificial neural networks is described where the tra¬ 
ditional neuronal non-linearity is eliminated in favor of a discontinuous piece- 
wise polynomial discretization of the weights space of each link. The use of 
discontinuous piecewise polynomial approximations leads to a network, which 
is the superposition of multiple networks with a set of shared weights as only 
a subset of the total network is active for each input signal. Standard back- 
propagation is used for error correction with the modification that sub links 
that do not fire are not included in the backpropagation step. The dropout 
technique [12] is used to minimize over fitting. It is found that piecewise 
quadratic polynomials generally produce much better results than piecewise 
linear for the same number of degrees of freedom and that moving to increas¬ 
ingly higher order polynomials can provide additional improvement. We have 
successfully demonstrated good solutions to the MAGIC and MNIST data 
sets and expect more complicated problems can be solved as well using this 
algorithm. Future work will include the addition of a neuronal non-linearity 
as well as investigation of non-convex optimization techniques. 
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